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LONG-INTEGER MULTIPLIER 

The present invention relates to methods and apparatus for the 
multiplication of two long integers and the addition of a third long, integer 
modulo a third long integer. Such multiplications must be earned out 
repeatedly during implementation of, for example, cryptographic algorithms in 
cryptographic processors such as those used in smart cards. 

The increasing use of cryptographic algorithms in electronic devices has 
established a need to quickly and efficiently execute long integer modular 
multiplications. For example, smart cards and many other electronic devices 
use a number of cryptographic protocols such as the RSA, and others based 
on elliptic curve and hyper elliptic calculations. All of these protocols have, as 
a basic requirement, the ability to perform long integer modular multiplications 
of the form R = X.Y + Z mod N, although the addition of Z is not always 
required. 

Typically, with protocols such as RSA, the lopg integers X and Y are 
1024-bit, or even 2048-bit integers, and the multiplication operations must be 
carried out many hundreds or thousands of times to complete an encryption or 
decryption operation. It is therefore desirable that the cryptographic devices 
that perform these operations execute the long integer multiplications quickly. " 

An aspect of carrying out such long integer multiplications is to break 
down the long integers into a number of words and to successively multiply the 
words together in an iterative processes which produces a succession of 
intermediate results which are cumulated to obtain the final result. A feature of 
this technique is the necessity for summing a large number of addends of 
various lengths during each stage of the multiplication process. Therefore, the 
number of addends for any given bit position can vary significantly. 
Conventionally, such summation operations can be implemented using 
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Wallace trees, bit these often make use of rather more hardware, and 
introduce rather more delay, than is desirable. 

It is an object of the present invention to provide a method and 
5 apparatus for effecting long integer multiplication operations as quickly as 
possible. 

It is an object of the invention to provide a more efficient method and 
apparatus for the summation of a large number of addends, particularly where 
the number of addend bits varies as a function of the bit position in the sum. 

10 In one arrangement, an adder circuit for multiplying two long integers 

deploys a network of adders for summing a succession of words of the long 
integers to generate intermediate results. The number of addends varies as a 
function of bit position and the network of adders is designed to reduce the 
number of levels of adders in the network according to a maximum number of 

15 expected addends. An object is to adapt the network to include a number of 
adders that varies as a function of bit position. 

In another arrangement, an output stage may be provided that adds 
sum and carry outputs of the network representing an intermediate result. An 
objective is to avoid delay in passing a carry bit from this output stage back to 

20 the network, by retaining a most significant (carry) bit for use with a 
subsequent calculation output of the network. 

In another arrangement, an objective is to enable the network to 
commence a subsequent calculation with a new set of addends prior to 
completion of the previous calculation. The network of adders may be 

25 configured so that the output of the previous calculation is fed back to the 
network at an intermediate level between its highest (input) level and its lowest 
(output) level. 

According to one aspect, the present invention provides an adder circuit 
30 for summing a plurality of addends from multi-bit words comprising: 

a network of n-input carry-save adder circuits each having a first 
number of sum outputs and a second number of carry outputs, 



the adder circuits being arranged in a plurality of columns, each column 
corresponding to a predetermined bit position in the sum, and being arranged 
in a plurality of levels, 

the first level receiving a number of "addends from corresponding bit 
positions of selected ones of the plurality of words and 

the lower levels each receiving addends from one or more of (i) 
corresponding bit positions of other selected ones of the plurality of words, (ii) 
sum outputs from a higher level adder circuit in the same column, and (iii) 
carry outputs from a higher level adder circuit in a column corresponding to a 
less significant bit position, 

wherein the number of n-input adders in each column varies according 
to the bit position. 

According to another aspect, the present invention provides an adder 
circuit comprising: 

an input for receiving a plurality of addends; 

first summation means for summing a plurality of addends to produce 
an output comprising a high order part and a first and second low order part; 

a first feedback line for coupling the first high order part to a lower order 
position at said input, for a subsequent calculation; and 

an output stage including second summation means for summing the 
first and second low order parts to provide a first word output and a feedback 
register for retaining a carry bit from said second summation means and for 
providing said carry bit as input to said second summation means during a 
subsequent calculation. 

According to another aspect, the present invention provides a pipelined 
adder circuit for summing a plurality of addends from multi-bit words 
comprising: 

first summation means comprising a network of carry-save adder 
circuits, the adder circuits being arranged in a plurality of columns, each 
column corresponding to a predetermined bit position in the sum, and being 
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arranged in a plurality of levels, the first level coupled for receiving a number of 
addends from corresponding bit positions of selected ones of the plurality of 
words and the lower levels coupled for receiving addends- from one or more of 
(i) corresponding bit positions of other selected ones of the plurality of words, 

5. (ii) sum outputs from a higher level adder circuit in the same column, and (iii) 
carry outputs from a higher level adder circuit in a coiumn corresponding to a, 
less significant bit position, 

a first feedback line for coupling a first plurality of more significant bit 
outputs of the lowest level adder circuits to a corresponding number of less 

10 significant bit inputs of an intermediate level of adder circuits for a subsequent 
calculation, the intermediate level being between said first and lowest level 
adder circuits. 

Embodiments of the present invention will now be described by way of 
15 example and with reference to the accompanying drawings in which: 

Figure 1 shows an array multiplier suitable for carrying out the 
multiplication operations, B.c + r = x.y + c+ z where x and c have a width of 64 
bits, while y, z and r have a width of 16 bits; 

Figure 2 shows a bit alignment of words to be added in a pipelined 
20 multiplier performing the calculation Rj = x n -j-iyo + z n -j-i + (x n .j-iyi + rj- 1i0 )B y + (x n .j- 
iya + r Hi i)B y 2 + ... + (x„-j.iy n -i + rj. 1(fr2 ) By* 1 " 1 + rj- 1t n-i) By", where each of the x.y 
word products is denoted by Pj, split into a number of products, e.g. P0...P15 
together with a sum term denoted by Z; 

Figure 3 is a graph showing the number of addends, per bit position, for 
25 the summation of words of figure 2; 

Figure 4 shows a fragment of a conventional Wallace tree structure 
suitable for implementing the pipelined summation of words of figure 2; 

Figure 5 shows a fragment of arr adaptive tree structure suitable for , 
implementing the pipelined summation of words of figure 2; 
30 Figure 6 shows a schematic block diagram of an unpipelined adder 

suitable for implementing the summation of words of figure 2; 



Figure 7 shows a schematic block diagram of a pipelined adder based 
on the structure of the adder of figure 6; 

Figure 8 shows a further fragment of the. adaptive tree structure of figure 
5, suitable for implementing the pipelined summation of words of figure 2; ■ 
5 Figure 9 shows a portion of an adaptive tree structure according to 

figure 5; and 

Figure 10 shows the insertion of a number of two-input carry-save 
adders for insertion into the adaptive tree structure of figure 9. 

10 To calculate the product X.Y + Z mod N where X, Y and Z are long- 

integer variables, eg. of the order of 1024 or 2048 bit length, the long-integer 
variables X, Y and Z are split into smaller "words" of, for example 32 or 64 bits 
in length. 

First, X and Z are split up into n words, generally each of length k, such 

15 that: 

X = x n .iB x n - 1 + x n . 2 Bx n - 2 + ... + x 0 , and 
Z = z n . 1 B x n - 1 + Zn . 2 B x n - 2 + ... +z 0 

20 where B x = 2 k . In one example, k = 32, and in another example k = 64. 

In this manner, X and Z are fragmented into a plurality of words each of" 
length k bits. 

Then, the result R can be calculated as follows: 

R = ((((Xn-iY + z,,.! mod N )Bx + x„. 2 Y + Zn_ 2 ) mod N)B X + ...x 0 Y + Zo) mod N 

Rn-l 

» Thus, Rj = (x n -j-iY + Zn-H + R^B X ) mod N. 

First, we multiply Xn-i by the complete Y and add Zn-i; then we calculate 
the modulo N reduction. The result is R 0 . 

Next, we multiply x n . 2 by the complete Y, add Za_ 2 and R 0 .B X to the result 
and calculate the modulo N reduction. The result is Rl 
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Next, we multiply Xn-3'by the complete Y, add Zn- 3 and R^B* to the result 
and calculate the modulo N reduction. The result is R 2 . 

This procedure is repeated until we have used all words of X, x 0 being 
the last word of X to be processed, to obtain the final result R = R„-i. 
s However, a multiplier for Y being 1024-bits long is undesirable from a 

practical viewpoint. Therefore, we aiso break down Y, and thus Rj, into smaiier 
"words" of, for example, 32 bits or 1 6 bits in length. 

Therefore, the basic multiplication Rj = (x n .j-iY + Zn_j.i + R H Bx) mod N, is 
also fragmented. 

10 . We split Y and Rj into p words of m bits in length, ie. B y = 2 m : 

Y = y p -iB y p - 1 + yp-aBy^ 2 + ... + y 0 
Rj = rj.p-.iBy' 5 - 1 + n.p.zByP" 2 + ... + r j( o 

15 For simplicity, we first assume that the lengths of X and Y are the same, 

and that the size of the X and Y words are the same, so that p = n and m = k. 
Later, we will show what has to be changed when this is not the case. 

In this manner, X and Y are fragmented into n words each of length k 
bits. Then, 

20 

^ B£^i ^ 

For the calculation of Rj, we perform the following operations: 
First, we multiply x n . H by y 0 . add r H _i = z n -j-i and split . the result into 
25 two equal parts: the lower part r jl0 (m-bits) and the higher part q, 0 (k-bits): B.c j<0 

+ rj)0 = x n -H- yo + rj_i _i . rj.o is saved as part of the outcome. 

Next, we multiply x n -j-i by yi and add the previous carry word Cj, 0 . 

Moreover, we add zq = rj-1,0 too. The result is again split into two equal parts: 



the lower part r Jf1 and the higher part Cj,-,: B.q,! + r j( i = x^y, + q,o + r H .o r M is 
saved as part of the outcome. 

Next, we multiply by y 2 and add the previous carry word q (1 . 
Moreover, we add Zi = rj. 1f1 too. The result is again split into two. equal parts: 
the lower part r ji2 and the higher part c,, 2 : B.c ji2 + r j>2 = x n ,H.y 2 + q,i + r Hi i.r k2 is 
saved as part of the outcome. 

This procedure is repeated until we perform the last multiplication, by y n . 
1, ie. we multiply x n . H by y„-i and add the previous carry word c j>n . 2 . Moreover, 
we add z n . 2 = rj. 1>n - 2 too. The result is again split into 2 parts, respectively of k- 
and m-bits in length: the lower part r jin -i and the higher part cj, n -i: B y .^ + rj.^ 
= Xn+.i.y n -i + Cj, n - 2 + rj_ 1>n -2-rj,n-i is saved as part of the outcome. 

The last step is the addition of Cj,„.i and z n _,: r jin = q.^ + rj-i.n-i.rj,n which 
is saved as part of the outcome. 

Now Rj is complete and is larger than the Y variable from which it was 
derived by the length of one word of X. The size of Rj is preferably reduced by 
one word in a modulo N reduction, and the reduced result is then used as Rj 
during the calculation of the subsequent Rj+i . 

The above calculation described the general procedure where the 
length of the X words (k) is the same as the length of the y words (m), ie. B x = 
B y . 

The X words may be different in length than the Y words. For example, 
if k/m > 1, k = 64 and m = 16, then B x = By 4 , then: 

1. The addition of z is done during the first k/m (= 4, in the example) 
multiplications and the addition of Rj starts thereafter. 

2. The carry word c jV is k/m (=' 4) times larger (4m bits in length) than the 
result r j(i (m bits in length). 

3. The last step consists of the addition of the carry word and the 
remaining part of Rj, which are both 4m bits wide. This addition might be done 
by the same multiplier by choosing y = 0 in k/m steps, where in each step 
words of m bits are added. 

Thus, in the basic operation, omitting all indices: 
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B.c + r = x.y + c + z 

During the first operation, c = 0, z consists of k/m words of Z followed by 
•all words of r. During the last k/m operations, y = 0. x is kept constant for the 
complete set of operations for each Rj. 

The same multiplier as performs the x-y multiplication can be used for 
modulo N reduction. After a complete set of multiplications by a word of X, ie. 
x, the result Rj is enlarged by one k-bit word. It must then be reduced by k-bits 
by modulo N reduction to retrieve the original length prior to computation of the 
next Rj. 

There are several possible algorithms for modulo reduction (eg. 
Quisquater, Barret, Montgomery, etc), but they all use the multiplication of the 

form: 

Rj = Xred-N + Rj. 

where Xred (having a size of k bits) times the modulus, N is added to the 
result. Alternatively, X re <i is subtracted by using the two's complement N' 
instead of N. The methods differ in the way that the factor X re d is calculated. 
For the Montgomery reduction, the result must also be divided by B x , ie. the 
first word, being all zero) is omitted. . 

The same basic operation can be used for the reduction: 

B.c + r = x.y + c + z 

with B = B y , r = r Jti , x = X re d, y = N, and z = r hi . 

The above multiplication operations can be carried out in a number of 
possible multipliers. However, an array multiplier is a conventional way of 
implementing such a multiplier. An example is shown in figure 1. 

The exemplary array multiplier 10 is a 64 by 16-bit multiplier, but other 
bit configurations can readily be used. The array multiplier 10 calculates each 
term in the expression R jt in the form B,c + r = x.y + c + z. x and c have a 
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width of 64 bits, y, z and r have a width of 16 bits, c, both as input and output, 
consists in fact of two terms, Cc and Cs. 

The basic element 12 of the array multiplier is shown inset in figure 1 
and includes a multiplier 13 receiving inputs * and y, and an adder 14 
5 receiving product terms x.y, carry and sum inputs s f and Cj to produce carry 
and sum outputs c 0 and s 0 therefrom. 

The array multiplier 10 consists of seventeen 'layers' or 'levels', *add1', 
'add2, ... 'add 17'. The first sixteen layers addl ... add 16 perform the 
multiplication and addition. The last layer, add 17, and the right-most elements 
10 in each layer) perform only additions. The outputs are 16-bit r(15:0) and a 63- 
bit carry term Cc'(79:16) and a 63-bit sum term Cs'(79:16). The sum of the 
carry term Cc' and the sum term Cs' is the carry term c in the calculation: 

B.c + r = x.y + c + z. 

15 

In fact, this term is never calculated. Instead, the calculation: 
B.(c' + s') + r = x.y + c' + s' + z 

20 is performed. The basic element 12 of the array multiplier 10 performs 

the bit calculation (c 0 , s 0 ) = y*x- + a + s,. The adding of z is done by the 
rightmost adder of every layer except the first one. The seventeenth layer 
consists only of adders, which is necessary for the addition of r(15). A 
drawback with the use of this implementation of array multiplier is the low 

25 speed at which it can operate, as a result of cumulative delays from seventeen 
layers of logic. 

Therefore, it is advantageous to use a pipelined multiplier in which the 
processing of the various stages can be overlapped to reduce the computation 
time. With reference to figure 2, the various addends required during the 
30 multiplication process are shown schematically. For a 64 by 16-bit multiplier, 
the process requires the addition of: (i) 16 product terms P 0 , Pi, ... P 15 with Pi 
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= X(63:0)>* YG); (") a 16-bit Z term Z(15:0); (iii) a 63-bit carry term Cc(62:0) 
and (iv) a 63-bit sum term Cs(62:0). 

The result Rj(15:0) is output and the intermediate terms Cc'(78:16), 
Cs'(78:16) are carried into the calculation of the next term R j+ i. 

Figure 3 gives the number of addends per bjt position. From bit position 
0 through to bit position 15, the number of addends increases linearly from 4 
up to 19 as more P terms are. included. Then it decreases by 1 for bit 16, 
since there are no more z-bits. The number of addends then remains constant 
at 18 right through to bit 62 when the carry and sum terms Cc and Cs drop out. 
Thus, a reduction in the number of addends by 2 to 16 occurs for bit position 
63. Finally, from bit position 63 on up to bit position 78, the number of 
addends decreases linearly from 16 down to 1 as each successively higher P 
term drops out. 

A Wallace tree is a conventional way of configuring an array of carry- 
save adders for the performance of the addition operations for a large number 
of addends, using an optimised number of levels. Figure 4 shows a fragment 
of such a Wallace tree 40. 

Each adder adds three inputs and has two outputs: a carry and a sum. 
A Wallace tree assumes that the number of addends per bit position is 
constant, and figure 4 shows the configuration of tree 40 that would be 
appropriate for implementing the required additions indicated by figure 3. In 
this case, the tree is configured for 19 addends per bit position, since this 
maximum occurs for bit position 1 5. 

At the first level, indicated as 'layer 1* on the drawing, there are six 
carry-save adders 41 for each bit position, eg. bit position j as shown, These 
six carry-save adders provide a total of eighteen inputs 42, six sum outputs 43 
and six carry outputs 44. Furthermore, there is one additional input 45, which 
is added into level 3 ('layer 3'). This gives the required total of nineteen inputs. 

The six sum outputs 43 are added in next level 2 by carry-save adders 
46. The six carry outputs 44 are added in the next level 2 of the tree but in the 
carry-save adders 56 of the next bit position to the left indicated as j+1. The 
carry-save adders 61 of the first level for the preceding bit. position j-1 also- 
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provide six carry outputs 64 which are provided to the adders 46 of level 2 for 
bit position j. The conventional Wallace tree assumes that the number of carry 
inputs (eg. 43, 44) equals the number of carry outputs, which is always the 
case when the number of inputs for each bit position at level 1 is the same. 

Such a Wallace tree gives the minimum number of levels for a given 
number of addends according to the tabie beiow: 



Number of addends 


Number of levels 


1,2,3 


1 


4 


2 


5, 6 


3 


7, 8, 9 


4 


10-13 


5 


14-19 


6 | 



It has been recognised that particularly - though not exclusively - for 
10 the computations required for the expression R = X * Y + Z mod N discussed 
above, the number of adders required for a given number of addends can be 
reduced, particularly when the number of addends is variable through the 
calculation. 

Figure 5 illustrates a section or fragment of the basic structure of an 
15 exemplary 'adaptive tree' or network 70 according to the present invention, for 
each of bit number positions j+1, j, and j-1, each bit position corresponding to 
a column in the tree. In the fragment of figure 5, the number of addends is 18 
in each bit position (column). This basic structure is used for ail bit positions, 
but the number of carry-save adders at each level and in each bit position is 
20 determined independently according to the number of addends required at that 
respective bit position. Figure 8 shows a further section of the adaptive tree 
70, specifically for bit positions 0 through to 8, respectively requiring 4 through 
to 12 addends (see figure 3). The adaptive tree therefore comprises a tree 
structure of adders which is structured to minimise or reduce the number of 
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adders required where there are variable numbers of input bits for the 

respective input bit positions. 

The determination of the structure of the adaptive tree or network is 

established according to the following rules. 

At the first level, the number of carry-save adders 71 in a given bit 

position is set to the number of input addends divided by three and rounded 

down to the nearest whole number. For example, for sixteen inputs, five 

adders are required. For eighteen inputs as illustrated in figure 5, position j, 

six adders 71 are required. 

At each of the subsequent levels, the number of adders for the given bit 

position is determined according to the expression: 
(number of adders for bit position j at level n) = 
{(number of sum outputs from level n-1 in bit position j) + 
(number of unconnected inputs of level n-1 in bit position j) + 
(number of carries of level n-1 in bit position j-1)} 
divided by 3 and rounded down to the nearest integer. 

Thus, referring specifically to figure 5, at an intermediate portion of the 
tree 70 requiring eighteen inputs for bit position j, at level 1, the number of 
adders 71 is six. At level 2, according to the formulation above, the number of 
adders 72 is INT{(6 + 0 + 6) / 3} = 4. At level 3-, the number of adders 73 is 
INT{(4 + 0 + 4) / 3} = 2. At level 4, the number J of adders 74 is INT{(2 + 2 + 2) / 
3} = 2. At level 5, the number of adders 75 is INT{(2 + 0 + 2) / 3} = 1 . Finally, 
for level 6, the number of adders 76 is INT{(1 + 1 + 1) / 3} = 1. It will be noted 
that for each of the bit positions j+1, j and j-1 , for eighteen addends, there is a 
saving of one carry-save adder at level 3 in each bit position. 

Referring ' specifically to figure 8, at one end of the tree 70 further 
savings are made, because the number of carries from the right is smaller - 
because of the increasing number input bits - than in the Wallace tree case. 
For example, at bit position 7, eleven addends are present. A conventional 
Wallace tree would suggest five levels. In fact, in this position, four levels, 
respectively having three, two, two and one adder(s) are required. 
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In some cases the number of levels can sometimes be reduced still 
further by the addition of a two-input carry-save adder at strategic positions 
within the network. First, the design is implemented using only three-input 
carry-save adders to form a network 70 according to the strategy defined 
above. To identify the strategic positions in which to insert a two-input carry 
save adder, it is necessary to identify, in each level ('L n ') and bit position OB/), 
locations where the number of inputs to that bit position Bj and level L n 
exceeds a minimum number, eg. two. Where it does, a two-input carry-save 
adder is inserted at a level (eg. U_i or L„_ 2> etc) above the location, at which 
level there are two unconnected addends. This effectively moves one input to 
the next higher order bit position B i+1 . This in turn may result in a 
consequential exceeding of the allowed number of outputs for the next bit 
position and therefore the procedure must be repeated a number of times until 
the number of inputs for ail bit positions does not exceed the allowed number. 

For example, referring specifically to figure 9, there may be a 
decreasing number of inputs for the higher order bits resulting in a higher than 
necessary number of layers. The maximum number of inputs per bit position 
is three, so one level of adders should be sufficient. In figure 9, we have three 
inputs for the adder 100 of bit position 58 and a carry output 101 from an adder 
in bit position 57 (not shown). We have two inputs for each of the adders 102, 
103 of bit positions 59 and 60 respectively, and one input for bit position 61. 
For bit position 59, we have three (instead of the desired two) outputs from 
level 1: one carry output from bit position 58 and two unconnected word inputs. 
Three levels (labelled layer 1, layer 2 and layer 3) are required because of the 
carry 101 from bit position 58 to 59 and in the same way from bit 59 J to 60. This 
gives two additional layers. 

With reference to figure 10, we can mitigate this situation by using extra 
two-input, two-output adders 110, 111 (labelled as 'CSA2', in contrast to the 
three-input, two-output carry save adders, 'CSA3'). Such adders do not 
reduce the number of inputs in total, but they do for that bit position by one. 
The CSA2 adder 110 increases the number of inputs for the next higher bit 
position 60 from two to three so the problem is moved to bit position 60 instead 
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of bit position 59. However, CSA2 adder 111. is also inserted which reduced 
the number of inputs to level 1, bit position 60 from three to two. The 
consequent increase in the number of inputs at bit position 61 from one to two 

does not matter. 4 

In principle, if has been recognised that strategic handling of pairs of 
addends with two-input adders at higher levels in a particular bit position can 
result in a further decrease in the number of levels. In other words, locally 
increasing the summation capacity with two input adders in one or more 
adjacent higher order positions can consequently reduce summation capacity 
required at lower levels, ultimately reducing the number of levels, without 
requiring an additional three-input adder. 

This solution increases the number of addends for a left neighbour 
which might, as a result, get too many inputs. If so, a number of two input 
adders may need to be inserted in a level until there is a bit position with a 
sufficiently low number of inputs as shown by bit position 61 . 

In a general sense, a procedure for inserting additional two-input carry- 
save adders may be defined as the following steps. Firstly, for a given number 
of levels, find a first location in the network having a bit position Bj and level L n 
where the number of outputs at that first location is greater than two (eg. three, 
instead of two) and where at some higher level there are two unconnected 
addends. Secondly, in respect of that first location, insert a two-input carry- 
save adder at a second location having the same bit position Bj but having a 
level (eg. U_i, Ln- 2 , etc) above the first location, at which location there are two 
unconnected addends. 

The procedure may need to be repeated a number of times until the 
number of inputs for all bit positions does not exceed the allowed number. 

With reference to figure 6, the adaptive tree may be used in an 
unpipelined adder configuration 80. In this arrangement, the adaptive tree has 
a maximum of six levels 81, 82 ... 86 for summing all the addends of figure 2. 
The adder sums all sixteen products P 0 ... Pis, Z and the feed back carry term 
Cc(62:0) and sum term Cs(62:0) using an adaptive tree of six levels. The 
output 87 of the. tree is registered, such that the higher order part of final carry 



term Cc*(78:16) and the higher order part of final sum term Cs'(78:16) output 
are fed back on feedback line 91 and shifted to bit positions (62:0) as input for 
the next calculation. The lower order part of carry term Cc'(15:0) and sum 
term Cs'(15:0) are summed by an 'additional full adder 88 and saved to register 
89, which is the term Y in the formula B.(c'+s') + r = x.y + (c'+s 1 ) +z. 

This later addition of the lower order parts of carry and sum terms 
Cc*(15:0) and Cs'(15:0) itself generates a further single bit carry term, 
identified in figure 6 as c" 16 . This single bit carry term is fed back for inclusion 
in the next summation by full adder 88, as indicated by the feedback line 90. 

Thus, in a general sense, the additional full adder 88 and register 89 
exemplify an output stage which add the sum and carry terms to provide a first 
word output of a final result, and to retain a carry bit c" 16 to be used as input 
for a subsequent stage of the calculation in which the main adder array 
generates a further, higher order sum and carry term for addition by the output 
stage. 

Alternatively, the carry term c" 16 could be fed back to level 1 , bit 0 of the 
adaptive tree as shown at 81, since it has the same weight as Cc'(16) and 
Cs*(16). A disadvantage of this technique is that the adaptive tree must wait 
for the c" 16 output of the full adder 88 before commencing a subsequent 
. calculation. Therefore it is preferable to use the full adder 88 to add the c" 16 
term. 

The carry bit c" 16 is cleared, like Cc' and Cs\ at the beginning of each 
new multiplication. 

In a further arrangement, as shown in figure 7, the adaptive tree 180 
can be given a pipelined configuration, having a number of levels 181. ..187. In 
this case, , it is generally necessary to feedback the higher order part of the 
carry Cc'(78:16) and sum Cs , (78:16) to a preceding level (ie. an 'intermediate' 
level 185) instead of the first level 181. Thus, in the specific arrangement 
shown in figure 7, rather than wait for the higher order part of final carry term 
Cc'(78:16) and the higher order part of final sum term Cs'(78:16) output from 
the last level 187 to be fed back to level 1 prior to commencement of the next, 
calculation, these terms, can be added in at level 5, as shown. Although this 
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arrangement increases the number of levels by one, to 7, the delay is reduced 
from a six level delay as in the arrangement of figure 6 to a four level, delay as 
in the arrangement of figure 7. 

In this configuration, in a general sense the feedback line 191 couples 
the more significant bit output of the adder circuits to a corresponding number 
of less significant bit inputs of an intermediate level of adder circuits. It may be 
necessary to provide an intermediate level register 191 for temporarily holding 
the summation results from the first four levels 181... 184. 

This increases the speed of operation by a factor of 1 .5, at the cost of a 
significant increase in hardware. In the example given, an additional 275 
registers are required to service the additional level. 

Another advantage of the adaptive tree occurs for pipelined versions. In 
figure 7, most of the adders of the lower order bit numbers, where at most four 
levels are required, are placed in the first four layers, thereby reducing the 
number of registers. By contrast, the Wallace Tree requires these adders to 
be placed in the lower layers. This therefore requires far more level 4 
registers, since the Wallace tree does not reduce the number of inputs in the 
upper levels for the lower bit numbers. 

The arrangement of figure 7 may also include the output stage 
188.. .190 as described in connection with the output stage 88.. .90 of the 
arrangement of figure 6. 

Other embodiments are intentionally within the scope of the 

accompanying claims. 



CLAIMS 
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1. An adder circuit (70) for summing a plurality of addends from 
multi-bit words comprising: 

a network of n-input carry-save adder circuits (71 ... 76) each having a 
i first number of sum outputs and a second number of carry outputs, 

the adder circuits being arranged in a plurality of columns Q), each 
column corresponding to a predetermined bit position in the sum, and being 
arranged in a plurality of levels, 

the first level receiving a number of addends from corresponding bit 
positions of selected ones of the plurality of words and 

the lower levels each receiving addends from one or more of (i) 
corresponding bit positions of other selected ones of the plurality of words, (ii) 
sum outputs from a higher level adder circuit in the same column, and (iii) 
carry outputs from a higher level adder circuit in a column corresponding to a 
less significant bit position, 

wherein the number of n-input adders in each column varies according 
to the bit position. 

2. The circuit of claim 1 in which the number of n-input adders in 
each column is specifically adapted to the number of addends required for that 
column. 

3. The circuit of claim 1 in which the number of n-input adders in 
each bit position of the first level does not exceed the integer part of the 
number of addends divided by n. 

4. The circuit of claim 1 or claim 4 in which the number of n-input 
adders in each bit position of the lower levels does not exceed the integer part 
of: 

the total of: (a) the number of sum outputs of the n-input adders in a 
higher level and the same column, (b) the number of unconnected inputs from 
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a higher level and the same column, and (c) the number of carry outputs from 
a higher level and a column corresponding to a less significant bit position, 
which total is divided by n. 

5. The circuit of claim 4 in which the number of unconnected inputs 
is that of the immediaie higher ievel. 

6. . The circuit of claim 4 in which the number of sum outputs is that 
of the immediate higher level. 

7. The circuit of claim 4 in which the number of carry outputs is that 
of the immediate higher level. 

8. The circuit of claim 1 in which n is three, the first number of sum 
outputs is two and the second number of carry outputs is two. 

9. The circuit of claim 1 further including means for delivering each 
one of the plurality of multi-bit words to the network of n-input adders such that 
the number of addends per bit position varies as a function of bit position. 

1 0. The circuit of claim 1 or claim 4 further including one or more (n- 
1)-input adders placed at selected positions within the network. 

11. The circuit of claim 10 in which the selected positions are 
determined so as to reduce the number of levels required to sum the plurality 
of addends. 

12. The circuit of claim 1 1 in which the n-input adders are three-input 
adders, the (n-l)-input adders are two-input adders, and in which each 
selected position is determined according to an identified bit position and level 
where the number of outputs would otherwise be greater than two, the 
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selected position being at a level above the identified position and in the same 
bit position. 

13. An adder circuit (80) comprising: 
an input for receiving a plurality of addends; 

first summation means (81. ..86) for summing a plurality of addends to 
produce an output (87). comprising a high order part (Cc'(78:16), Cs'(78:l6)).and 
a first and second low order part (Cc'(15:0), Cs'(15:0)); 

a first feedback line (91) for coupling the first high order part to a lower 
order position at said input, for a subsequent calculation; 

an output stage including second summation means (88) for summing 
the first and second low order parts to provide a first word output (89) and a 
feedback register ' (90, c" 16 ) for retaining a carry bit from said second 
summation means and for providing said carry bit as input to said second 
summation means during a subsequent calculation. 

14. The adder circuit of claim 13 in which the high order part 
comprises a sum term and a carry term fed back to a subsequent calculation. 

15. The adder circuit of claim 13 in which the carry bit (c"i 6 ) is used 
at the end of a subsequent calculation of the first and second low order parts 
by the first summation means (81. ..86). 

16. The adder circuit (80) of claim 13 for summing a plurality of 
addends from multi-bit words in which: 

the first summation means comprises a network of carry-save adder 
circuits (81 ...86) each having a number of inputs, a number of sum outputs and 
a number of carry outputs, 

the adder circuits being arranged in a plurality of columns, each column 
corresponding to a predetermined bit position in the sum, and being arranged 
in a plurality of levels (81 ... 86), 
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the first level (81) coupled for receiving a number of addends from 
corresponding bit positions of selected ones of the plurality of words and 

the lower levels (82 ... 86) coupled for receiving addends from one or 
more of (i) corresponding bit positions of other selected ones-of the plurality of 
words, (ii) sum outputs from a higher level adder circuit in the same column, 
and (Hi) carry outputs from a higher level adder circuit in a column 
corresponding to a less significant bit position, 

the first feedback line (91) coupling a first plurality of more significant bit 
outputs (87) of the lowest level (86) adder circuits, as said first high order part, 
to a corresponding number of less significant bit inputs of said first level of 
adder circuits at said lower order position. 

17. The adder circuit of claim 13 or claim . 15 in which the high order 
part comprises a high order carry term output and a high order sum term 
output, and in which the first low order part comprises a low order carry term 
output and the second low order part comprises a low order sum term output. 

18. A pipelined adder circuit (1 80) for summing a plurality of addends 
from multi-bit words comprising: 

first summation means (181.. .187) comprising a network of carry-save 
adder circuits, the adder circuits being arranged in a plurality of columns, each 
column corresponding to a predetermined bit position in the sum, and being 
arranged in a plurality of levels (181... 187), the first level (181) coupled for 
receiving a number of addends from corresponding bit positions of selected 
ones of the plurality of words and the lower levels coupled for receiving 
addends from one or more of (i) corresponding bit positions of other selected 
ones of the plurality of words, (ii) sum outputs from a higher level adder circuit 
in the same column, and (iii) carry outputs from a higher level adder circuit in a 
column corresponding to a less significant bit position, 

a first feedback line (191) for coupling a first plurality of more significant 
bit outputs of the lowest level (187) adder circuits to a corresponding number 
of less significant bit inputs of an intermediate level (1 85) of adder circuits for a 
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subsequent calculation, the intermediate level being between said first and 
lowest level adder circuits. 

19. The pipelined adder circuit of claim 18 further including an output 
stage including second summation means for summing first and second low 
order parts respectively comprising a second and a third plurality of less 
significant bit outputs of the lowest level adder circuits to provide a first word 
output and a feedback register for retaining a carry bit from said second 
summation means and for providing said carry bit as input to said second 
summation means during a subsequent calculation. 

20. Apparatus substantially as described herein with reference to the 
accompanying drawings, figures 5 to 1 1 . 
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ABSTRACT 

LONG-INTEGER MULTIPLIER - 

An adder circuit for multiplying two long integers deploys a network of 
adders for summing a succession of words of the long iniegers to generate 
intermediate results. The number of addends varies as a function of bit 
position and the network of adders is designed to reduce the number of levels 
of adders in the network according to a maximum number of expected 
addends. A number of strategically placed extra adders may be positioned in 
the network to further reduce the number of levels. An output stage may be 
provided that adds sum and carry outputs of the network and retains a most 
significant bit for use with a subsequent calculation output of the network. The 
network may be configured so that a subsequent calculation by the network 
can commence before the previous calculation has been completed, the output 
of the previous calculation being fed back to the network at an intermediate 
level between its highest (input) level and its lowest (output) level. 

[Figure 5] 
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